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Abstract — Graphics processors, or GPUs, have recently been widely used as accelerators in the shared environments such as clusters 
and clouds. In such shared environments, many kernels are submitted to GPUs from different users, and throughput is an important 
metric for performance and total ownership cost. Despite the recently improved runtime support for concurrent GPU kernel executions, 
the GPU can be severely underutilized, resulting in suboptimal throughput. In this paper, we propose Kernelet, a runtime system with 
dynamic slicing and scheduling techniques to improve the throughput of concurrent kernel executions on the GPU. With slicing, Kernelet 
divides a GPU kernel into multiple sub-kernels (namely slices). Each slice has tunable occupancy to allow co-scheduling with other 
slices and to fully utilize the GPU resources. We develop a novel and effective Markov chain based performance model to guide the 
scheduling decision. Our experimental results demonstrate up to 31 .1% and 23.4% performance improvement on NVIDIA Tesla C2050 
and GTX680 GPUs, respectively. 
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1 Introduction 

The graphics processing unit (or GPU) has become an ef- 
fective accelerator for a wide range of applications from 
computation-intensive applications (e.g., [25], [26J, |43|) 
to data-intensive applications (e.g, llOl , IITHl ). Compared 
with multicore CPUs, new-generation GPUs can have 
much higher computation power in terms of FLOPS 
and memory bandwidth. For example, an NVIDIA Tesla 
C2050 GPU can deliver the peak single precision float- 
ing point performance of over one Tera FLOPS, and 
memory bandwidth of 144 GB/s. Due to their immense 
computation power and memory bandwidth, GPUs have 
been integrated into clusters and cloud computing in- 
frastructures. In Top500 list of November 2012, two out 
of the top ten supercomputers are with GPUs integrated. 
Amazon and Penguin have provided virtual machines 
with GPUs. In both cluster and cloud environments, 
GPUs are often shared by many concurrent GPU pro- 
grams (or kernels) (most likely submitted by multiple 
users). Additionally, to enable sharing GPUs remotely, 
a number of software frameworks such as rCUDA [7] 
and V-GPU [45 1 have been developed. This paper studies 
whether and how we can improve the throughput of 
concurrent kernel executions on the GPU. 

Throughput is an important optimization metric for 
efficiency and the total ownership cost of GPUs in such 
shared environments. First, many GPGPU applications 
such as scientific and financial computing tasks are usu- 
ally throughput oriented (9]]. A high throughput leads 
to the high performance and productivity for users. 
Second, compared with CPUs, GPUs are still expensive 
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devices. Therefore, a high throughput not only means 
a high utilization on GPU resources but also the total 
ownership cost of running the application on the GPU. 
That might also be one of the reasons that GPUs are 
usually deployed and shared to handle many kernels 
from users. 

Recently, we have witnessed the success of GPGPU 
research. However, most studies focus on single-kernel 
optimizations (e.g., new data structures Ifl7l and GPU 
friendly computing patterns flOl , IfTTI ). Despite the fruit- 
ful research, a single kernel usually severely under- 
utilizes the GPU. This severe underutilization is mainly 
due to the inherent memory and computation behavior 
of a single kernel (e.g., irregular memory accesses and 
execution pipeline stalls). In our experiments, we have 
studied eight common kernels (details are presented 
in Section On C2050, their average IPC is 0.52, 
which is far from the optimal value (1.0). Their memory 
bandwidth utilization is only ranging from 0.02% to 14%. 

Recent GPU architectures like NVIDIA Fermi |27| ar- 
chitecture supports concurrent kernel executions, which 
allows multiple kernels to be executed on the GPU 
simultaneously if resources are allowed. In particular, 
Fermi adopts a form of cooperative kernel scheduling. 
Other kernels requesting the GPU must wait until the 
kernel occupying the GPU voluntarily yields control. 
Here, we use NVIDIA CUDA's terminology, simply be- 
cause CUDA is nowadays widely adopted in GPGPU 
applications. A kernel consists of multiple executions 
of thread blocks with the same program on different 
data, where the execution order of thread blocks is not 
defined. On the Fermi GPUs, one kernel can take the 
entire GPU as long as it has sufficient thread blocks 
to occupy all the multi-processors (even though it can 
have severely low resource utilization). Concurrent exe- 
cution of two such kernels almost degrades to sequential 
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execution on individual kernels. Recent studies |34| on 
scheduling the concurrent kernels mainly focus on the 
kernels with low occupancy (i.e., the thread blocks of 
a single kernel cannot fully utilize all GPU multipro- 
cessors). However, the occupancy of kernels (with large 
input data sizes in practice) is usually high after single- 
kernel optimizations. 

Individual kernels as a whole cannot realize the real 
sharing of the GPU resources. A natural question is: 
can we slice the kernel into small pieces and then 
co-schedule slices from different kernels in order to 
improve the GPU resource utilization? The answer is 
yes. One observation is that GPU kernels (e.g., those 
written in CUDA or OpenCL) conform to the SPMD 
(Single Program Multiple Data) execution model. In 
such data-parallel executions, a kernel execution can 
usually be divided into multiple slices, each consisting 
of multiple thread blocks. Slices can be viewed as low- 
occupancy kernels and can be executed simultaneously 
with slices from other kernels. The GPGPU concurrent 
kernel scheduling problem is thus converted to the slice 
scheduling problem. 

With slicing, we have two more issues to address. The 
first issue is on the slicing itself: what is the suitable 
slice size? How to perform the slicing in a transparent 
manner? The smallest granularity of a slice is one thread 
block, which can lead to significant runtime overhead of 
submitting too many such small slices onto the GPUs for 
execution. To the other extreme, the largest granularity 
of the slice is the entire kernel, which degrades to the 
non-sliced execution. The second issue is how to select 
the slices for co-schedule in order to maximize the GPU 
utilization. 

To address those issues, we develop Kernelet, a runtime 
system with dynamic slicing and scheduling techniques 
to improve the GPU utilization. Targeting at the concur- 
rent kernel executions in the shared GPU environment, 
Kernelet dynamically performs slicing on the kernels, 
and the slices are carefully designed with tunable oc- 
cupancy to allow slices from other kernels to utilize the 
GPU resources in a complementary way. For example, 
one slice utilizes the computation units and the other 
one on memory bandwidth. We develop a novel and ef- 
fective Markov chain based performance model to guide 
kernel slicing and scheduling in order to maximize the 
GPU resource utilization. Compared with existing GPU 
performance models which are limited to a single kernel 
only, our model are designed to handle heterogeneous 
workloads (i.e., slices from different kernels). We further 
develop a greedy co-scheduling algorithm to always co- 
schedule the slices from the two kernels with the highest 
performance gain according to our performance model. 

We have evaluated Kernelet on two latest GPU ar- 
chitectures (Tesla C2050 and GTX680). The GPU kernels 
under study have different memory and computational 
characteristics. Experimental results show that 1) our an- 
alytical model can accurately capture the performance of 
heterogeneous workloads on the GPU, 2) our scheduling 



increase the GPU throughput by up to 31.1% and 23.4% 
on C2050 and GTX680, respectively. 

Organization. The rest of the paper is organized as 
follows. We introduce the background and definition of 
our problem in Section [2] Section [3] presents the system 
overview, followed by detailed design and implementa- 
tion in Section |U The experimental results are presented 
in Section [5] We review the related work in Section [6] and 
conclude this paper in Section 

2 Background and Problem Definition 

In this section, we briefly introduce the background 
on GPU architectures, and next present our problem 
definition. 

2.1 GPU Architectures 

GPUs have rapidly evolved into a powerful accel- 
erator for many applications, especially after CUDA 
was released by NVIDIA [30]. The top tags in 
http: / / gpgp u.org/ 1 show that a wide range of appli- 
cations have been accelerated with GPGPU techniques, 
including vision and graphics, image processing, linear 
algebra, molecular dynamics, physics simulation and 
scientific computing etc. Those applications cover quite 
a wide range of computation and memory intensiveness. 
In the shared environment like clusters and cloud, it 
is very likely that users submit kernels from different 
applications to the same GPU. Thus, it is feasible to 
schedule kernels with different memory and computa- 
tion characteristics to better utilize GPU resources. 

This paper focuses on the design and implementation 
with NVIDIA CUDA. Since OpenCL and CUDA have 
very similar designs, our design and implementation 
can be extended to OpenCL with little modification. 
Kernelet takes advantage of concurrent kernel execution 
capability of new-generation GPUs like NVIDIA Fermi 
GPUs. With the introduction of CUDA, a GPU can be 
viewed as a many-core processor with a set of streaming 
multi-processors (SM). Each SM has a set of scalar cores, 
which executes the instructions in the SIMD (Single 
Instruction Multiple Data) manner. The SMs are in turn 
executed in the SPMD manner. The program is called 
kernel. 

In CUDA's abstraction, GPU threads are organized in a 
hierarchical configuration: usually 32 threads are firstly 
grouped into a warp; warps are further grouped into 
thread blocks. The CUDA runtime performs mapping 
and scheduling at the granularity of thread blocks. Each 
thread block is mapped and scheduled on an SM, and 
cannot be split among multiple SMs. Once a thread block 
is scheduled, its warps become active on the SM. Warp is 
the smallest scheduling unit on the GPU. Each SM uses 
one or more warp schedulers to issue instructions of the 
active warps. Another important resource of the GPU 
is shared memory, which is a small piece of scratchpad 
memory at the scope of a thread block. It is small and 
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Fig. 1: Application scenarios of concurrent kernel execu- 
tions on the GPUs. 



has very low latency. Shared memory is visible for all 
the threads in the same thread block. 

We define the SM occupancy as the ratio of active 
warps to the maximum active warps that are allowed to 
run on the SM. Higher occupancy means higher thread 
parallelism. The aggregated register and shared memory 
usage of all warps should not exceed the total amount 
of available registers and shared memory on an SM. 

Block Scheduling. CUDA runtime system maps 
thread blocks to SMs in a round-robin manner. If the 
number of thread blocks in a kernel is less than the 
number of SMs on the GPU, each thread block will be 
executed on a dedicated SM; otherwise, multiple thread 
blocks will be mapped to the same SM. The number of 
thread blocks that can be concurrently executed on the SM 
depends on their total resource requirements (in terms 
of registers and shared memory). If the kernel currently 
being executed on the GPU cannot fully utilize the GPU, 
the GPU allows to schedule thread blocks from other 
kernels for execution. 

GPU Code Compilation. In the process of CUDA 
compilation, the compiler first compiles the CUDA C 
code to PTX (Parallel Thread Execution) code. PTX is 
a virtual machine assembly language and offers a sta- 
ble programming model for evolving hardware archi- 
tectures. PTX code is further compiled to native GPU 
instructions (SASS). The GPU can only execute the SASS 
code. CUDA executables and libraries usually provide 
either PTX code or SASS code or both. In the shared 
environments, the source code is usually not available 
for kernel scheduling upon users submit the kernels. 
Thus, Kernelet should be designed to work on both PTX 
and SASS code. 



2.2 Problem Definition 

Application scenario. We consider two typical applica- 
tion scenarios in the shared environments as shown in 
Figure [TJ One is sharing the GPUs among multiple ten- 
ants in the virtualized environment (e.g., cloud). As illus- 
trated in Figure [Taj there is usually a GPU virtualization 
layer integrated with the hypervisor. Figure [lb] shows the 



other scenario, in which GPU servers offer API reception 
softwares (like rCUDA Q, |45l ) to support local/remote 
CUDA kernel launches. In both scenarios, the GPU faces 
multiple pending kernel launch requests. Kernelet can be 
applied to schedule those pending kernels. 

Our study mainly considers the throughput issues of 
sharing a single GPU. Kernelet can be extended to multi- 
ple GPUs with a workload dispatcher to each individual 
GPU. 

We have made the following assumptions on the 
kernels. 

1. We target at the concurrent kernel executions on 
the shared GPU. The kernels are usually through- 
put oriented, with flexibility on the response time 
for scheduling. Still, we do not assume the a priori 
knowledge of the order of the kernel arrival. 

2. Thread blocks in a kernel are independent with 
each other. This assumption is mostly true for the 
GPGPU kernels due to SPMD programming model. 
Most kernels in NVIDIA SDK and benchmarks like 
Parboil fl3l do not have dependency among thread 
blocks in the same kernel. This assumption ensures 
that our slicing technique on a given kernel is safe. 
The data dependency among thread blocks can be 
identified with standard static or dynamic program 
analysis. 

We formally define the terminology in Kernelet. 

Kernel. A kernel JC consists of k thread blocks with 
IDs, 0, 1, 2, (k - 1). 

Slice. A slice is a subset of the thread blocks of a 
launched kernel. Block IDs of a slice is continuous in 
the grid index space. The size of a slice s is defined as 
the number of thread blocks contained in the slice. 

Slicing plan. Given a kernel JC, a slicing plan S(1C) is 
a scheme slicing JC into a sequence of n slices (sq, s\, S2, 
s n _i). We denote the slicing plan to be JC=sq, s±, S2, 
s n _i. 

Co-schedule. Co-schedule cs defines concurrent exe- 
cution of n (n > 1) slices, denoted as so, s n _i. All the 
n slices are active on the GPU. 

Scheduling plan. Given a set of n kernels JCq, JC\, 
JC n , a scheduling plan C (cs , cs\, cs„_i) determines 
a sequence of co-schedules in their order of execution. 
csi is launched before csj if i < j. All thread blocks of 
the n kernels occur in one of the co-schedules once and 
only once. A scheduling plan embodies a slicing plan for 
each kernel. 

We define the performance benefit of co-scheduling n 
kernels to be the co-scheduling profit (CP) in Eq. {l). 
IPCi and cIPCi are IPC (Instruction Per Cycle) for 
sequential execution and concurrent execution of kernel i 
respectively. Our definition is similar to those in the pre- 
vious studies on CPU multi-threaded co-scheduling [20 1, 

El. 

CP = 1 - -s-^ (1) 

v-^ cIPCi 

iPd 
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Problem definition. Given a set of kernels for execu- 
tion, the problem is to determine the optimal scheduling 
plan (and slicing) so that the total execution time for 
those kernels is minimized. That corresponds to the 
maximized throughput. Given a set of n kernels /Co, 
K\, /C n _i, we aim at finding the optimal scheduling 
plan C for a minimized total execution time of C(Sa(K, ), 
<Si(/Ci), <S„-i(/C ri _i)), or maximized co-scheduling 
profit. Note, in the shared GPU environment, the arrival 
of new kernels trigger the recalculation of the optimiza- 
tion on the kernel residuals and new kernels. 
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Fig. 2: Design overview of Kernelet. 



3 System Overview 

In this section, we present the rationales on the Kernelet 
design, followed by an overview of the Kernelet runtime. 

3.1 Design Rationale 

Since kernels are submitted in an ad-hoc manner in our 
application scenarios, the scheduling decision has to be 
made at real time. The optimization process should take 
the newly arrived kernels into consideration. Moreover, 
our runtime system of slicing and scheduling should be 
designed with light overhead. That is, the overhead of 
slicing and scheduling should be small compared with 
their performance gain. 

Unfortunately, finding the optimal slicing plans and 
scheduling plan is a challenging task. The solution space 
for such candidate plans is large. For slicing a kernel, we 
have the factors including the number of slices as well 
as the slice size. For scheduling a set of slices, we can 
generate different scheduling plan with co-scheduling 
slices from different kernels. All those factors are added 
up into a large solution space. Considering newly arrival 
kernels makes the huge scheduling space even larger. 

Due to the real-time decision making and light-weight 
requirements, it is impossible to search the entire space 
to get a globally optimal solution. The classic Monte 
Carlo simulation methods are not feasible because they 
usually exceed our budget on the runtime overhead and 
violate real-time decision making requirement. There 
must be a more elegant compromise between the op- 
timality and runtime efficiency. Our solution to the 
scheduling problem is detailed in Section l4~2l 

Given the complexity of dynamic slicing and schedul- 
ing in concurrent kernel executions, we make the follow- 
ing considerations. 

First, the scheduling considers two kernels only. Previ- 
ous studies l20l on the CPU have shown that when there 
are more than two co-running jobs, finding the optimal 
scheduling plan becomes an NP-complete problem even 
ignoring the job length differences and rescheduling. 
Following previous studies [20 1, l34l , we make our 
scheduling decision co-scheduling two kernels only. 

Second, once we choose two kernels to schedule, their 
slice sizes keep unchanged until either kernel finishes. 
The suitable slice size is determined according to our 
performance model l4~4l 



3.2 System Overview 

We develop Kernelet as a runtime system to generate 
the slicing plan and scheduling plans for the optimized 
throughput. 

Figure |2] shows an overview of Kernelet. Kernels 
are submitted and are temporarily buffered in a ker- 
nel queue for further scheduling. Usually, a kernel is 
submitted in the form of binary (SASS) or PTX code. 
Submitted kernels are first preprocessed by kernel slicer 
to determine the smallest slice size for a given overhead 
limit. If the kernel has been submitted before, we simply 
use the smallest slice size in the previous execution. Ker- 
nelet's scheduler determines the scheduling plan based 
on the results of performance model, which estimates 
the performance of slices from two different kernels 
in a probabilistic manner. Once the scheduling plan is 
obtained, slices are dispatched for execution on the GPU. 
We will describe the detailed design and implementation 
of each component in the next section. 

4 Kernelet Methodology 

In this section, we first describe our kernel slicing mecha- 
nism. Next, we present our greedy scheduling algorithm, 
followed by our pruning techniques on the co-schedule 
space and the description of the performance model. 

4.1 Kernel Slicing 

The purpose of kernel slicing is to divide a kernel into 
multiple slices so that the finer granularity of each slice 
as a kernel in the scheduling can create more opportu- 
nities for time sharing. Moreover, we need to determine 
the suitable slice size for minimizing the slicing overhead 
(i.e., between the total execution time of all slices and the 
kernel execution time) . Particularly, we experimentally 
determine the suitable slice size to be the minimum slice 
so that the overhead is not greater than p% of the kernel 
execution time. In this study, p% is set to be 2% by 
default. We focus on the implementation of slicing on 
PTX or SASS code. Note that, warps within the same 
thread block usually have data dependency with each 
other, e.g., with the usage of shared memory. That is 
why we choose thread blocks as the units for slicing, 
instead of warps. 
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void Ma trixAdd( float *MatrixA, float "MatrixB, int 
width){ 

int row = blockID_X*blockDim_X+threadID_X-, 
int col = blockID_Y*blockDim_Y+threadID_Y; 
/* Each thread process one element */ 
A[row+col*width] += B[row+col*width]; 



/* Grid dimension: 16x16 */ 

dim3 gridConf( 16,16); 

/* Block dimension: 16x16 */ 

dim3 blockConf(i6,i6); 

MatrixAdd<«gridConf, blockConf»>(...); 



(a) Each thread access the cor- (b) Launch the same number 
responding matrix element us- threads as the number of ma- 
ing block and thread indices. (- r j x elements. 



void MatrixAdd(float *MatrixA, float *MatrixB, int 
width, dim3 blockOffset, dim3 gridConf){ 
/* Compute rectified indices*/ 
int rBlockID_X - bIockID_X+blockOffset.x; 
int rBlockID_Y - blocklDJY+blockOffset.y; 
/* Process carry bit*/ 
while (rBlockID_X > gridConf.x)( 
rBIocklDX — gridConf.x; 
rBlockID_Y ++; 

} 

/* Replace all subsequent access to b!ockID_X 
(blockID_Y) with rBlockID_X (rBlockID_Y) */ 



/* Sliced grid dimension: 8x1 */ 
dim3 sGridConf(8,i); 
dim3 blockOffset - (0,0); 
while(blockOffset.x < gridConf.x && \ 

blockOffset.y < gridConf.y)) 
MatrixAdd«<sGridConf, blockConf>»(.. 
blockOffset, gridConf); 

blockOffset.x +- sliceGrid.x; 
while (blockOffset.x > gridConf.x){ 
blockOffset.x -= gridConf.x; 
blockOffset.y ++; 

) 

} 



(c) Sliced kernel with rectified 
thread block indices. 



(d) Launch all slices of a kernel 
in a loop. 



Fig. 3: An example of kernel slicing. 



requires a single scan on the input code and the runtime 
overhead is negligible. 

4.2 Scheduling 

According to our design rationales, our scheduling de- 
cision is made on the basis of two kernels, to avoid 
the complexity of scheduling three or more kernels as 
a whole; the slice sizes of the two kernels are tuned 
for GPU utilization so that their execution times in the 
concurrent kernel execution are close. Thus, we develop 
a greedy scheduling algorithm, as shown in Algorithm [T] 
The scheduling algorithm considers new arrival kernels 
in Lines 2-3 in the main algorithm. The main procedure 
calls the procedure FindCoSchedule to obtain the optimal 
co-schedule in Line 5. The co-schedule is represented 
in four parameters < tC\, IC2, sizei, size2 >, where /Ci 
and IC2 denotes the two selected kernels; sizei and size 2 
represents the slice sizes accordingly. We use the same 
co-schedule if the kernels pending for execution do not 
change, or both kernels still have thread blocks. 



Figure |3] illustrates the sliced execution of Matrix Add 
with pseudo code. In the example, MatrixAdd is a GPU 
kernel to add two 256 x 256 matrices and each thread 
adds one element pair from the input matrices. Based on 
the matrix size, MatrixAdd is configured to launch with 
16 x 16 thread blocks in total and each block has 16 x 16 
threads. Figures |3a] and [3b] illustrate the definition and 
launch of the original kernel, respectively. In comparison, 
the sliced version of the kernel launches a slice with 8 
thread blocks each time. The built-in thread block indices 
(denoted as blockID_X and blockID_Y) of the sliced 
kernel are in a smaller index space ({(x,0)|0 < x < 8}) 
compared with the original index space ({(x,y)\Q < x < 
16 and < y < 16}). To make the slices execute as 
individual kernels, we apply a procedure called index 
rectification. As shown in Figure [3cJ we add an offset 
value to the thread block indices and obtain the rectified 
index values. The rectified indices are used to replace all 
subsequent accesses to the built-in indices. On the CPU 
side, we launch the slices in a loop and adjust the offset 
values for each slice launch (Figure [3d). 

Kernelet automatically implements slice rectification 
without any user intervention. With PTX or SASS code 
as input, Kernelet does not require source code. Kernelet 
interprets and modifies the PTX/SASS code at runtime. 
The resulting PTX code is compiled to GPU executables 
by the GPU driver, and the SASS code is assembled using 
the open source Fermi assembler Asfermi [1J. Kernelet 
stores the rectified block index in registers, and replaces 
all references to those built-in variables with the new 
registers. Since this process may use more registers. 
Kernelet tries to minimize the register usage by adopting 
the classic register minimization techniques JH), 1351 , e.g., 
variable liveness analysis. With register optimizations, 
register usage by slicing keeps unchanged in most of 
our test cases in experiments. Note, kernel slicing only 



Algorithm 1 Scheduling algorithm of Kernelet 

l: Denote 1Z to be the set of kernels pending for execu- 
tions; 

2: 
3 



if A new kernel JC comes then 
Add K into 11; 
4: while 7\L!=null do 

5: < JCi, IC2, sizei, size 2 >=FindCoSchedule(lZ); 
6: Denote the co-schedule to be c; 
7: Execute c on the GPU; 

8: while 1Z does not change, or fCi and both still 

have thread blocks do 
9: Generate co-schedule according to c and execute 
it on the GPU; 
Proc. FindCoSchedule(lZ) 

Function: generate the optimal co-schedule from 
K. 

l: Generate the candidate space for co-schedules C; 

2: Perform pruning on C according to the computation 
and memory characteristics of input kernels; 

3: Apply the performance model (Section l4.4|l to com- 
pute CP for all the co-schedule in C; 

4: Obtain the optimal co-schedule with the maximized 
CP; 

5: Return the result co-schedule; 

In Procedure FindCoSchedule, we first consider the 
entire candidate space consisting of co-schedules on 
pair-wise kernel combinations. Because the space may 



consider of 



N(N-l) 



co-schedules (N is the number of 



kernels for consideration), it is desirable to reduce the 
search space. Therefore, we perform pruning according 
to the computation and memory characteristics of input 
kernels. We present the details of pruning mechanisms 
in Section 14.31 After pruning, we apply the perfor- 
mance model (Section to estimate the CP for all co- 
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schedules, and pick the one with the maximized CP for 
executing on the GPU. 

4.3 Co-scheduling Space Pruning 

Given a set of co-schedules as input, we aim at devel- 
oping pruning techniques to remove the co-schedules 
that are not "promising" to deliver performance gain. So 
that the overhead of running the performance model is 
avoided. The basic idea to identify the key performance 
factor of a single kernel that affects the throughput of 
concurrent kernel executions on the GPU. 

There are many factors affecting the GPU perfor- 
mance. According to the CUD A profiler, there are around 
100 profiler counters and statistics for the performance 
tuning. For an effective scheduling algorithm, there is no 
way of considering all these factors. Following a previ- 
ous work l24l , we use the regression model to explore 
the correlation between the above mentioned factors and 
CP. The input values are obtained from single kernel 
executions. Through detailed performance studies, we 
find that instruction throughput and memory bandwidth 
utilization are the most correlated performance factors 
with co-scheduling friendliness. We define PUR (Pipeline 
Utilization Ratio) and MUR (Memory-bandwidth Uti- 
lization Ratio) to characterize user submitted kernels. 
High PUR means the instruction pipeline is highly uti- 
lized and there is little room for performance improve- 
ment. High MUR means a large number of memory 
requests are being processed and memory latency is 
high. PUR and MUR is calculated as follows, 

pjjj^ Instruction_Executed 

Time x Frequency x Peak_IPC 

MUR Dram_Reads + Dram_Writes 
Time x Frequency x Peak_MPC 

PeaklPC and Peak_MPC represent peak number of 
instructions and memory requests per cycle respectively. 
Instruction_Executed is the total number of instructions 
executed. Dram_Reads and Dram_Writes are the num- 
bers of read and write requests to DRAM respectively. 

We build a set of testing kernels to demonstrate the 
correlation between PUR/MUR and CP. A testing kernel 
is a mixture of memory and computation instructions. 
We tune the respective instruction ratios to obtain ker- 
nels with different memory and computational charac- 
teristics. The single kernel execution PURs and MURs 
of the testing kernels are in the range of [0.26, 0.83] and 
[0.07, 0.84] respectively. Figure S] shows the strong cor- 
relation between MUR/ PUR and CP. The observation 
conforms to our expectation. First, if one kernel has high 
PUR while the other kernel has low PUR, the former 
kernel is able to utilize the idle cycles exposed by the 
latter kernel when co-scheduled. Second, co-scheduling 
kernels with complementary memory requirements (one 
kernel has low MUR and the other kernel has high MUR) 




(a) MUR and CP (b) PUR and CP 

Fig. 4: Correlation between MUR/PUR and CP 



will alleviate memory contention and reduce idle cycles 
exposed by long latency memory operations. 

In summary, our pruning rule is to remove the co- 
schedules where the two kernels have close PUR or MUR 
values. We set two threshold values a p and a m for PUR 
and MUR, respectively. That means, we prune the co- 
schedule if the two kernels have PUR difference lower 
than a p , or have MUR difference lower than a m . Note, if 
all the co-schedules are pruned, we need to increase a p 
or a m . We experimentally evaluate the impact of those 
two threshold values in Section [5] 

4.4 Performance Model 

We need a performance model for two purposes: firstly, 
to select the two kernels for co-schedule; secondly, to 
determine the number of thread blocks for each kernel 
in the co-schedule (i.e., the slice size). Previous perfor- 
mance models on the GPU Il9l . 1551 , |44| assume a single 
kernel on the GPU, and are not applicable to concurrent 
kernel executions. They generally assume that the thread 
blocks execute the same instruction in a round-robin 
manner on an SM. However, this is no longer true on 
concurrent kernel executions. The thread blocks from dif- 
ferent kernels have interleaving executions, which cause 
non-determinism on the instruction execution flow. It is 
not feasible to statically predict the interleaving warp ex- 
ecutions for multi-kernel executions. To capture the non- 
determinism, we develop a probabilistic performance 
model to estimate the performance of co-schedule. Our 
cost model has very low runtime overhead, because it 
uses a series of simple parameters as input and leverages 
the Markov chain theory to get the performance of 
concurrent kernel executions. 

Table [I] summarizes the notations used for our perfor- 
mance model. 

Since the GPU adopts SPMD model, we use the perfor- 
mance estimation of one SM to represent the aggregate 
performance of all SMs on the GPU. We model the pro- 
cess of kernel instruction issuing as a stochastic process 
and devise a set of states for an SM during execution. 
By modeling the SM state, we first develop our Markov 
chain based models for single-kernel executions (homo- 
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TABLE 1: Parameters and notations in the performance 
model 



Para. 


Description 


W 


Maximum number of active warps 


Round 


A warp scheduling cycle that all ready warps 
are served by the warp scheduler 


Rm 


Memory instruction ratio 


Pr—±r 


Probability that a ready warp remains ready 


Pr—ti 


Probability that a ready warp transits to idle 


Pi — >r 


Probability that an idle warp transits to ready 


P%—^i 


Probability that an idle warp remains in idle 


N r —!). r 


Number of ready warps that remain ready 


N r -,i 


Number of ready warps that transit to idle 


Ni^r 


Number of idle warps that transit to ready 


Ni^i 


Number of idle warps that remain idle 


L 


Average memory latency (cycle) 


B 


GPU global memory bandwidth (requests /cy- 
cle) 


Sr 


Si corresponds the state where i warps are idle 
on the SM (i = 0, 1, . . . , W) 


P 


P is the Markov chain transit matrix. Entry 
Pij of P represents the probability of transiting 
from state Si to state Sj. 



geneous workloads), and then extend it to concurrent 
kernel executions (heterogeneous workloads). 

For presentation clarity, we begin with our description 
on the model with the following assumptions, and relax 
those assumptions at the end of this section. First, we 
assume that all the memory requests are coalesced. 
This is the best case for memory performance. We will 
relax this assumption by considering both coalesced and 
uncoalesced memory accesses. Second, we assume that 
the GPU has a single warp scheduler. We will extend it 
to the GPU with multiple warp schedulers. 

Homogeneous Workloads. We first investigate the 
performance of a single kernel executed on the GPU and 
each SM accommodates W active warps at most. 

A warp can be in two states: idle or ready. An idle 
warp is stalled by memory accesses, and a ready warp 
has one instruction ready for execution. Its transition is 
illustrated in Figure |5] When a warp is currently in the 
ready state, we have two cases for state transitions by 
definition: 

• remaining in the ready state with the probability of 

P r ^ r 1 Rm- 

• transiting to the idle state with the probability of 

Pr—¥i — Prw 

When a warp is currently in the idle state, we also 
have two cases for state transitions: 

« transiting to the ready state with the probability of 
Pi^>r = l = ^ 7 1 1 where I is the number of idle 

W — I 

warps on the SM. 

• remaining in the idle state with the probability of 

Pi^i = 1 — Pi— >r- 

We specifically define the time step and state tran- 
sition of the Markov chain model to capture the GPU 
architectural features. GPU adopts a round-robin style 
scheduling (19ll . In each round, the warp scheduler polls 
each warp to issue its ready instructions so all ready 




Fig. 5: Warp state transition diagram. 



warps can make progress. We model the SM state with 
the number of idle warps. We denote Si to be the SM 
state where i warps are idle on the SM (i = 0, 1, . . . , W). 
Thus, we consider the state change of the SM in one 
round and use round as time step in our Markov chain 
model. In each round, every ready warp has an equal 
chance to issue instructions. In contrast, models for the 
CPU assume that the CPU will keep executing one 
thread until this thread is suspended. 

We use IPC to represent the throughput of the SM. 
Thus, the number of idle warps on the SM is a key 
parameter for IPC . Thus, we define the state of SM as 
the number of idle warps on the SM (i.e., the state is Si 
when the number of idle warps is i). More outstanding 
memory requests usually lead to higher latency because 
of memory contention [3]. We adopt a linear memory 
model to account for the memory contention effects. We 
calculate L as L = Lq + a s + bo, where ao and bo are 
the constant parameters in the linear model. We obtain 
I/o and B according to hardware specifications. 

For homogeneous workload, the probabilities of state 
transitions are the same for all ready warps in a round. 
We assume when SM transits from 5, to Sj, N^ r idle 
warps transit to the ready state and N r ^i ready warps 
transit to idle state. The following conditions hold by 
definition. 

f < < Si 

I < N r -n <W-Si (2) 

[N r -n - Ni^. = Sj - Si 

With those constraints, there are multiple possible 
transitions to transit from Si to Sj. Since the possible 
transitions are mutually exclusive events, the probability 
of each state transition is calculated as the sum of the 
probabilities of all possible transitions. With all entries 
of the transition matrix P obtained, we can calculate 
the steady-state vector of the Markov chain. This is 
done by finding the eigenvector 7r corresponding to the 
eigenvalue one for matrix P l3ll . 

Ti" = (7o,7i>-»7w) (3) 

In Equation FJ) 7* is the probabilities that the SM stays 
in state Si in each round, i.e., the probability there are i 
idle warps in one round. The duration of the time step is 
(W — i) cycles since each of the (W—i) ready warps issue 
one instruction within the round. In the case i = W , the 
round duration is one, indicating no warp is ready and 
the SM experiences an idle cycle. Hence, the estimated 
IPC is the ratio of non-idle cycles given in equation HJ 



s 



where Ei=o 1 7i(^ — is the total non-idle cycles and 
jw is the total idle cycles. 



IPCk 



(4) 



Heterogeneous Workloads. When there are multiple 
kernels running concurrently the model needs to keep 
track of the state of each workload. Although we only 
consider two concurrent kernels {JC\ and K2) in schedul- 
ing, our model can be used to handle more than two 
kernels. 

Suppose there are two kernels JC\ and K2, and /Ci has 
w\ active warps and K,% has u>2 active warps {w\ +W2 = 
W). The number of possible states of the SM will be (wi+ 
1) x (ii>2 + 1). The state space is represented as a pair (p, q) 
with < p < wi and < q < IV2, where p and q are the 
numbers of idle warps of /Ci and IC2 respectively. We can 
calculate the probability of transiting from state (pi , qi' ) 
to state (pj , qy ) by first considering individual workload 
state transition probability using the single kernel model, 
and then calculating the SM state transition probability. 
The state transitions of different kernels are independent 
with each other, because the kernels are independent. 
Then the SM state transition probability is the product 
of the individual transition probabilities. 

With Markov chain approach, we obtain the steady 
state vector tt = {7(0,0), 7(o,i), ■■■■> l( Wl .w 2 )}- Next, we can 
obtain the IPC of each workload using the same method 
as the model in single-kernel executions, except the 
parameters are defined and calculated in the context of 
two kernels. For example, the round duration is equal 
to the total number of ready warps of both kernels. 
Individual IPCs of JC\ and JC2 is calculated as the ratio 
of non-idle cycles for each workload, as shown in Eq. © 
and |(6), respectively. The concurrent IPC is the sum of 
individual IPCs Eq. 10. Then CP can be obtained using 
Eq. ©. 



IPC, 



ICi 



IPC, 



k: 2 



l= 2^j=0 7(i,j) X H (i,j) 

t=0 T,i'=0 7(M') x{w 2 -i) 
J27=o Th>=o x R (i,i') 



C = IPC, 



Ki 



IPC K 



(5) 

(6) 
(7) 



With the estimated IPC and CP, we now discuss how 
to estimate the optimal slice size ratio for two kernels. 
We define the slice ratio which minimizes the execution 
time difference of co-scheduled slices as the balanced 
slice ratio. By minimizing the execution time difference, 
the kernel-level parallelism is maximized. The execution 
time difference is calculated as AT in Eq. {S). 

AT = Ixpfe- x x flci - jpbr 2 x Ik -> x Pk -2 I (8) 

I/Cf and P/Ci represent the number of instruction per 
block and the slice size of kernel /Q (i = 1 , 2) in number 
of thread blocks. Since Pfd is less than the maximal 



number of active thread blocks, only a limited number 
of slice ratios need to be evaluated to get the balanced 
ratio. 

Uncoalesced Access. So far, we assume that all mem- 
ory accesses are coalesced and each memory instruc- 
tion results in the same number of memory requests. 
However, due to the different address patterns, memory 
instructions may result in a different amount of memory 
requests. On Fermi GPUs, one memory instruction can 
generate 1 to 32 memory requests. Here we consider 
the two most common access patterns: fully coalesced 
access, and fully uncoalesced access. We extend our 
model to handle both coalesced and uncoalesced ac- 
cesses by defining three states for a warp: ready, stalled 
on coalesced access (uncoalesced idle), and stalled on 
uncoalesced access (coalesced idle). The memory opera- 
tion latency depends on the memory access type. Since 
uncoalesced access generates more memory traffic, its 
latency is higher than coalesced access. We also use the 
linear model to estimate the latency. By identifying the 
ratio of coalesced and uncoalesced memory instructions, 
we can easily extend the two-state model to handle 
three states and their state transitions. The Markov chain 
performance model can be developed in a similar way. 
Distinguishing between coalesced and uncoalesced ac- 
cesses increases the accuracy of our model. 

Adaptation to GPUs with multiple warp schedulers. 
Our model assumes there is only one warp scheduler. 
New-generation GPUs can support more than one warp 
schedulers. The latest Kepler GPU features four warp 
schedulers per SMX (SMX is the Kepler terminology 
for SM) [28J. We extend our model to handle this case 
by deriving a single pipeline virtual SM based on the 
parameters of the SMX. The virtual SM has one warp 
scheduler, and its parameters such as active thread 
blocks and memory bandwidth are obtained by dividing 
the corresponding parameters of the SMX by the number 
of warp schedulers. This virtual SM can still capture the 
memory and computation features of a kernel running 
on the SMX. Experimental results in Section [5] show 
that performance modeling on the virtual SM provides 
a good estimation on the Kepler architectures. 

There are two more issues that are worthwhile to 
discuss. 

The first issue is on the efficiency of executing our 
model at runtime. We have developed mechanisms to 
make our model more efficient without significantly 
sacrificing the model accuracy. The 0(N 3 ) complexity 
of calculating the steady state in Markov chain makes it 
hard to meet the online requirement (N is the dimension 
of the transition matrix). To reduce the computational 
complexity, we consider the thread block as a schedul- 
ing unit, instead of considering individual warps. In 
this way, the computational complexity is significantly 
reduced, and time cost of our model is negligible to the 
GPU kernel execution time. 

The second issue is on getting the input for the model. 
Our current approach is based on hardware profiling of 
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TABLE 2: GPU configurations. 





C2050 


GTX680 


Architecture 


Fermi GF110 


Kepler GK104 


Number ot SMs 


14 


8 


Number of cores per SM 


32 


192 


Core frequency (MHz) 


1147 


706 


Global memory size (MB) 


3072 


2048 


Global memory bandwidth (GB/s) 


144 


192 



a small number of thread blocks from a single kernel. 
Thus, the pre-execution is only a very small part of 
the kernel execution. From profiling, we can obtain the 
number of memory instructions issued and the total 
number of instructions executed, and calculate Rm as 
their ratio. 

In summary, our probabilistic model has captured the 
inherent non-determinism in concurrent kernel execu- 
tions. First, it simply requires only a small set of profiling 
inputs on the memory and computation characteristics 
of individual kernels. Second, with careful probabilistic 
modeling, we develop a performance model that is 
sufficiently accurate to guide our scheduling decision. 
The effectiveness of our model will be evaluated in the 
experiments (Section [5). 

5 Evaluation 

In this section, we present the experimental results on 
evaluating Kernelet on latest GPU architectures. 

5.1 Experimental Setup 

We have conducted experiments on a workstation 
equipped with one NVIDIA Tesla C2050 GPU, one 
NVIDIA GTX680 GPU, two Intel Xeon E5645 CPUs and 
24GB RAM. Table shows some architectural features of 
C2050 and GTX680. We note that C2050 and GTX680 are 
based on Fermi and Kepler architectures, respectively. 
One C2050 SM has two warp schedulers, and each can 
serve half a warp per cycle (with a theoretical IPC of 
one). In contrast, one GTX680 SMX features four warp 
schedulers and each warp scheduler can serve one warp 
per cycle (with a theoretical IPC of eight considering its 
dual-issue capability). Our implementation is based on 
GCC 4.6.2 and NVIDIA CUDA toolkit 4.2. 

Workloads. We choose eight benchmark applications 
with different memory and computation intensivenesses. 
Sources of the benchmarks include the CUDA SDK, the 
Parboil Benchmark fl3l . the CUSP library [4| and our 
home grown applications. Table describes the details 
of each application, including input settings and thread 
configurations of the most time-consuming kernel on 
C2050. 

Table |4] shows the memory and computation char- 
acteristics of the most time-consuming kernel of each 
application on both C2050 and GTX680. We observed 
that the PUR/MUR values are stable as we vary the 
input sizes (as long as the input size is sufficiently large 
to keep the GPU occupancy high). 



TABLE 3: Specification of benchmark applications and 
thread configuration (#threads per thread block x 
#thread blocks). 



Name 


Description 


Input settings 


Thread configu- 
ration on C2050 


Pointer Chas- 
ing (PC) 


Traversing an 
array randomly 


Index values for 40 
million accesses 


256 x 16384 


Sum of Ab- 
solute Differ- 
ences (SAD) 


An operation 
used in Ml EG 
encoding 


Image with 1920 x 
1072 pixels 


32 x 8048 


Sparse 

Matrix Vector 
Multiplica- 
tion (SPMV) 


Multiplying a 
sparse matrix 
with a dense 
vector. 


A 131072x81200 
matrix with 16 
non-zero elements 
per row on average 


256 x 16384 


Stencil (ST) 


Stencil 

operation on a 
regular 3-D grid 


3D grid with 
134217728 points 


128 x 16384 


Matrix Mul- 
tiplication 
(MM) 


Multiplying two 
dense matrices 


One 8192 x 2048 
matrix, the other 
2048 x 2048 


256 x 16384 


Magnetic 
Resonance 
Imaging - Q 
(MRIQ) 


A matrix opera- 
tion in magnetic 
resonance imag- 
ing 


2097152 elements 


256 x 8192 


Black Scholes 
(BS) 


Black-Scholes 
Option Pricing 


40 million 


128 x 16384 


Tiny 

Encryption 
Algorithm 
(TEA) 


A thread block 
cipher 


20971520 elements 


128 x 16384 



TABLE 4: Memory and computational characteristics of 
benchmark applications. 



Benchmarks 


C2050 


GTX680 


PUR 


MUK 


Occupancy 


PUR 


MUR 


Occupancy 


PC 


0.0096 


0.1404 


100% 


0.0072 


0.1746 


100% 


SAD 


0.1498 


0.1120 


16.7% 


0.1062 


0.1351 


25% 


SPMV 


0.3464 


0.003 


100% 


0.3027 


0.0043 


100% 


ST 


0.3629 


0.1156 


66.7% 


0.2016 


0.1179 


100% 


MM 


0.5804 


0.0161 


67.7% 


0.5321 


0.0569 


100% 


MPJQ 


0.8539 


0.0002 


83.3% 


1.6784 


0.0007 


100% 


BS 


0.8642 


0.0604 


67.7% 


1.2007 


0.1323 


100% 


TEA 


0.9978 


0.0196 


67.7% 


1.1417 


0.0353 


100% 



To assess the impact of kernel scheduling under dif- 
ferent mixes of kernels, we create four groups of kernels 
namely CI, MI, MIX and ALL (as shown in Table 0. 
CI represents the computation-intensive workloads in- 
cluding kernels with high PUR, whereas MI represents 
workloads with intensive memory accesses. MIX and 
ALL include a mix of CI and MI kernels. ALL has 
more kernels than MIX. In each workload, we assume 
the application arrival conforms to Poisson distribution. 
The parameter A in the Poisson distribution affects the 
weight of the application in the workload. For simplicity, 
we assume that all application has the same A. We 
also assume A is sufficiently large so that at least two 
kernels are pending for execution at any time for a high 
utilization of the GPU. 

Comparisons. To evaluate the effectiveness of kernel 
scheduling in Kernelet, we have implemented the fol- 

TABLE 5: Workload configurations. 



Workload 


Applications 


CI 


BS, MM, TEA, MRIQ 


MI 


PC, SPMV, ST, SAD 


MIX 


PC, BS, TEA, SAD 


ALL 


PC, SPMV, ST, BS, MM, TE, MRIQ, SAD 
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-SAD 




(a) C2050 (b) GTX680 

Fig. 6: Sliced execution overhead with varying slice size on both C2050 and GTX680. 



lowing scheduling techniques: 

• Kernel Consolidation (BASE): the kernel consolida- 
tion approach of concurrent kernel execution (34j. 

• Oracle (OPT): OPT uses the same scheduling al- 
gorithm as Kernelet, except that it pre-executes all 
possible slice ratios for all combinations to obtain 
the CP and then determines the best slice ratio and 
kernel combination. In another word, OPT is an 
offline algorithm and provides the optimal through- 
put for the greedy scheduling algorithm. 

> Monte Carlo-based co-schedule (MC): we develop 
a Monte Carlo approach to generate the distribution 
of performance of different co-schedule methods in 
the solution space. In each Monte Carlo simulation, 
we randomly pick the kernel pairs and slice ra- 
tios for co-scheduling. Through many Monte Carlo 
simulations, we can quantitatively understand how 
different co-schedules affect the performance. We 
denote the result of MC to be MC(s), where s is 
the number of Monte Carlo simulations. 

5.2 Results on Kernel Slicing 

We first evaluate the overhead of sliced execution, which 
is defined as ^ — 1, where T s and T ns are the sliced and 
unsliced execution time, respectively. 

Figure [6] shows the overhead for executing individual 
kernels with varying slice sizes on C2050 and GTX680. 
Slice sizes are set to multiples of the number of SMs 
on the GPU and ranges from \SM\ to the maximum 
number under the occupancy limit. Overall, as the slice 
size increases, the slicing overhead decreases. However, 
we observe quite different performance behaviors on 
C2050 and GTX680, due to their architectural differences. 
On C2050, when the size is small, the slicing overhead is 
very high (up to 66.7% for SAD). When the slice is larger 
than or equal to 42 (three thread blocks per SM), the 
overhead is ignorable for most kernels. Sliced execution 
overhead on GTX680 is much smaller than on C2050. 
Almost all slice sizes lead to overhead less than 2% on 
GTX680. Regardless of the architectural differences, the 
ignorable overhead of kernel slicing allows us to exploit 
kernel slicing for co-scheduling slices from different 
kernels with little additional cost. 
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Fig. 7: Comparison between predicted and measured 
single kernel execution IPCs on two GPUs. 



5.3 Results on Model Prediction 

We evaluate the accuracy of our performance model in 
different aspects, including the estimation of IPCs for 
single kernels and concurrent kernel executions, and CP 
prediction for concurrent kernel executions. 

Single Kernel Performance Prediction. Figure com- 
pares the measured and estimated IPC values for the 
eight benchmark applications on C2050 and GTX680. We 
also show the two lines ( y = x ± 0.2 for C2050 and 
y = x± 1.6 for GTX680) to highlight the scope where dif- 
ference between measurement and estimation is within 
± 20% of the peak IPC. Note, the theoretical IPCs for 
C2050 and GTX680 are one and eight respectively. If 
the result falls in this scope, we consider the estimation 
well captures the trend of the measurement. We can 
see that, most results are within the scope. We further 
define the absolute error to be e — e'|, where e and e' are 
the measured and estimated IPC values, respectively. 
The average absolute error for the eight benchmark 
applications is 0.08 and 0.21 on C2050 and GTX680, 
respectively. Our probabilistic model has achieved a 
reasonable accuracy in estimating the performance of 
single-kernel executions on the GPU. 

Concurrent Kernel Performance Prediction. For the 
eight benchmark applications, we run every possible 
combination of kernel pairs and measure the IPC for 
each combination. Figure |8] compares the measured and 
predicted IPCs with the suitable slice ratio given by our 
model. We have also studied other slicing ratios. Figured 
compares the measured and predicted IPCs with fixed 
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Fig. 8: Comparison between predicted and measured 
concurrent kernel execution IPCs on two GPUs with 
optimal slice ratio. 
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Fig. 9: Comparison between predicted and measured 
concurrent kernel execution IPCs on two GPUs with 
fixed one-to-one slice ratio. 
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concurrent kernel execution IPCs with/ without consid- 
ering uncoalesced access on C2050. 
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Fig. 11: Comparison between predicted and measured 
concurrent kernel execution IPCs without considering 
multiple warp schedulers on GTX680. 



ratio = 1:1. We observed similar results on other fixed 
ratios. Regardless of different kernel combinations and 
slicing ratios, our model is able to well capture the trend 
of concurrent executions for both dynamic and static 
slice ratios. 

Model Optimizations. We further evaluate the im- 
pact of incorporating coalesced /uncoalesced memory 
accesses and the number of warp schedulers on the GPU. 
Only two applications (PC and SPMV) in our bench- 
mark have uncoalesced memory accesses. We conduct 
the single kernel execution prediction experiments by 
(wrongly) assuming those two kernels with coalesced 
memory accesses only. The results are shown in Fig- 
ure [TUJ Without considering uncoalesced access, the pre- 
dicted IPC values are much larger than measurements 
since the assumption of coalesced access only underes- 
timates the memory contention effects. 

Figure [TT] shows the results of concurrent execution 
IPC prediction on GTX680 without considering the mul- 
tiple warp schedulers. The estimation without consider- 
ing the number of warp schedulers severely underesti- 
mates the IPC on GTX680, in comparison with the results 
in Figure [8] 

CP Prediction. We further evaluate the accuracy of 
CP prediction. Figure [12] shows the comparison between 
measured and predicted CP on C2050. We observe sim- 
ilar results on GTX680. The prediction is close to the 
measurement. With accurate prediction on IPC , the CP 
difference between prediction and measurement is small. 



The results are sufficiently good to guide the scheduling 
decision as shown in the next section. 

5.4 Results on Kernel Scheduling 

In this section, we evaluate the effectiveness of our 
kernel scheduling algorithm by comparing with BASE 
and OPT. To simulate the continuous kernel submission 
process, we initiate 1000 instances for each of each 
kernel mix and submit them for execution according 
to Poisson distributions. Different scheduling algorithms 
are applied and the total kernel execution time is re- 
ported. Figure [13] shows the total execution time of those 
kernels on C2050 and GTX680. On all the four workloads 
with different memory and computation characteristics, 
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TABLE 6: Number of kernels pruned with varying a. 
and a m on C2050. 



a p 


0.1 


0.2 


0.3 


0.4 


0.5 


0.6 


0.7 


0.8 


0.9 


1.0 


0.015 








2 


2 


3 


4 


4 


4 


4 


4 


0.03 





2 


5 


6 


7 


8 


9 


9 


9 


9 


0.045 





3 


7 


8 


9 


10 


11 


11 


11 


11 


0.06 


1 


4 


8 


9 


12 


13 


15 


15 


15 


15 


0.075 


1 


4 


8 


9 


12 


13 


15 


15 


15 


15 


0.09 
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12 


13 


15 


15 
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Kernelet outperforms Base (with the improvement 5.0- 
31.1% for C2050 and 6.7-23.4% for GTX680). Kernelet 
achieves similar performance to OPT (with the difference 
0.7-3.1% for C2050 and 4.0-15.0% for GTX680). The 
performance improvement of Kernelet over Base is more 
significant on MIX and ALL, because Kernelet have more 
chances to select kernel pairs with complementary re- 
source usage. Still, Kernelet outperforms Base in CI and 
MI, because slicing exposes the scheduling opportunities 
(even though they are small on CI and MI). 

Table [6] shows the number of kernels pruned with 
different pruning parameters a p and a m on C2050. In- 
creasing a p and a m leads to more kernel combinations 
being pruned. Similar pruning result is oberserved for 
GTX680. Varying those two parameters can affect the 
pruning power and also the optimization opportunities. 
Thus, we choose the default values for a p and a m as 
0.4 and 0.1 on C2050, and 0.4 and 0.105 on GTX680, 
respectively, as a tradeoff between pruning power and 
optimization opportunities. 

We finally study the execution time distribution of the 
scheduling candidate space. Figure [14] shows the CDF 
(cumulative distribution function) of the execution time 
of the MC(1000). As we can see from the figure, none of 
the random schedules is better than Kernelet. It demon- 
strates that random co-schedules hurt the performance 
in a high probability due to the huge space of schedule 
plans. 




2665000 2670000 2675000 2680000 2685000 2690000 2695000 
Execution Time (ms) 

Fig. 14: CDF (cumulative distribution function) of exe- 
cution time of MC(1000). 

6 Related Work 

In this section, we review the related work in two cate- 
gories: 1) scheduling algorithms on CPUs, especially for 
CPUs with SMT (Simultaneous Multi-Threading), and 2) 
multi-kernel executions on GPUs. 

6.1 CPU Scheduling and Performance Modeling 

SMT has been an effective hardware technology to 
better utilize CPU resources. SMT allows instructions 
from multiple threads to be executed on the instruction 
pipeline at the same time. Various SMT aware schedul- 
ing techniques have been proposed to increase the CPU 
utilization |8), (2D, |39l-|I2). The core idea of SMT aware 
scheduling is to co-schedule threads with complemen- 
tary resource requirements. Several mechanisms have 
been proposed to select the threads to be executed on 
the same SMT core, such as hardware counters feed- 
back (HI, Hll, pre-execution of all combinations [39J 
and probabilistic job symbiosis model [8 J. Performance 
models including the Markov chain based ones have 
also been adopted for concurrent tasks modeling on 
the CPUs. Serrano et al. [36], |37| developed a model 
to estimate the instruction throughput of super-scalar 
processors executing multiple instruction streams. Chen 
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et al. [§1 proposed an analytical model for estimating 
throughput of multi- threaded CPUs. 

Despite the fruitful research results on SMT schedul- 
ing, they are not applicable to GPUs due to the architec- 
ture differences. First, main performance affecting issues 
are different for CPU and GPU applications. L2 data 
cache is a key consideration issue for SMT-aware thread 
scheduling, whereas thread parallelism is usually more 
important for the performance of GPGPU programs [2], 
l[T9l , (38| . Second, scheduling on GPUs is not as flexible 
as that on CPUs. Current GPUs do not support task 
preemption. Third, unlike CPUs supporting the concur- 
rent execution of a relatively small number of threads, 
each GPGPU kernel launches thousands of threads. 
Additionally, the maximum number of co-scheduling 
threads equals to the number of hardware context on 
the CPU, while the number of active warps on GPUs is 
dynamic, depending on the resource usage of the thread 
blocks. The slicing, scheduling and performance models 
in Kernelet are specifically designed for GPUs, taking 
those issues into consideration. 

6.2 GPU Multiple Kernel Execution and Sharing 

In the past few years, GPU architectures have under- 
gone significant and rapid improvements for GPGPU 
support. Due to lack of concurrent kernel support in 
early GPU architectures, researchers initially proposed 
to merge two kernels at the source code level [12[, [14|. 
In those methods, two kernels are combined into a single 
kernel with if-else branches on different granularities 
(e.g., thread blocks). They have three major disadvan- 
tages compared with our approach. First, combining the 
code of two kernels will increase the resource usage of 
each thread block, leading to lower SM occupancy and 
performance degradation [29]. Second, those approaches 
require source code, which may not be always available 
in the shared environments. Third, it requires two ker- 
nels with different block sizes avoiding using barriers 
within the thread block, otherwise deadlock may occur. 

Recently, new-generation GPUs like NVIDIA Fermi 
GPUs support concurrent kernel executions. Taking ad- 
vantage of this new capability, a number of multi- 
kernel optimization techniques [15], [23 1, [35] have been 
developed to improve the utilization of GPUs. Ravi et 
al. l34l proposed kernel consolidation to enable space 
sharing (different kernels run on different SMs) and time 
sharing (multiple kernels reside on the same SM) on 
GPUs. Space sharing happens when the total number of 
thread blocks of all kernels does not exceed the number 
of SMs and each block can be executed on a dedicated 
SM. If the total number of thread blocks is larger than the 
number of SMs, while SMs have sufficient resources to 
accommodate more thread blocks from different kernels, 
time sharing happens. That means, kernel consolidation 
does not have space sharing and have little time sharing 
when the launched kernels have sufficient thread blocks 
to occupy the GPU. Furthermore, they determined the 



kernel to be consolidated with heuristics based on the 
number of thread blocks. In contrast, Kernelet utilizes 
slicing to create more opportunities for time sharing, and 
develops a performance model to guide the scheduling 
decision. Peters et al. ||32l used a persistently running 
kernel to handle requests from multiple applications. 
GPU virtualization has also been investigated [15], [23]. 

Recent studies also address the problem of GPU 
scheduling when multiple users co-reside in one ma- 
chine. Pegasus |[T6l coordinates computing resources like 
accelerators and CPU and provides a uniform resource 
usage model. Timegraph If22ll and PTask 11351 manage the 
GPU at the operating system level. Kato et al. [2"T| intro- 
duced the responsive GPGPU execution model (RGEM). 
All those scheduling methods do not consider how to 
schedule concurrent kernels in order to fully utilize the 
GPU resources. 

As for performance models on GPUs, Hong |[T9ll 
and Kim [38] proposed analytical models based on the 
round-robin warp scheduling assumption. Baghsorkhi et 
al. introduced the work flow graph interpretation 
of GPU kernels to estimate their execution time. All 
those models are designed for a single kernel. More- 
over, they usually require extensive hardware profiling 
and/ or simulation processes. In contrast, our perfor- 
mance model is designed for concurrent kernel execu- 
tions on the GPU, and relies on a small set of key 
performance factors of individual kernel to predict the 
performance of concurrent kernel executions. 

7 Conclusion 

Recently, GPUs have been more and more widely used 
in clusters and cloud environments, where many kernels 
are submitted and executed on the shared GPUs. This 
paper proposes Kernelet to improve the throughput 
of concurrent kernel executions for such shared envi- 
ronments. Kernelet creates more sharing opportunities 
with kernel slicing, and uses a probabilistic performance 
model to capture the non-deterministic performance fea- 
tures of multiple-kernel executions. We evaluate Kernelet 
on two NVIDIA GPUs, Tesla C2050 and GTX680, with 
Fermi and Kepler architectures respectively. Our exper- 
iments demonstrate the accuracy of our performance 
model, and the effectiveness of Kernelet by improving 
the concurrent kernel executions by 5.0-31.1% and 6.7- 
23.4% on C2050 and GTX680 on our workloads, respec- 
tively. 
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